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In the advent of new large galaxy surveys, which will produce enormous datasets with hundreds of 
millions of objects, new computational techniques are necessary in order to extract from them any 
two-point statistic, the computational time of which grows with the square of the number of objects 
to be correlated. Fortunately technology now provides multiple means to massively parallelize this 
problem. Here we present a free-source code specifically designed for this kind of calculations. 
Two implementations are provided: one for execution on shared-memory machines using OpenMP 
and one that runs on graphical processing units (CPUs) using CUDA. The code is available at 
http : / /members . if t . ua m-csic.es/ dmonge/ CUTE . html , 

PACS numbers: 98.80.-k IFT-UAM/CSlC-12-31 



I. INTRODUCTION 



The last decades have seen an increasing interest in 
galaxy surveys as a means of studying the late-time evo- 
lution of the Universe. Forthcoming galaxy surveys, such 
as DBS [U, BigBOSS [2] or Euclid 0], will map large re- 
gions of the sky (0(10^"'') sq-deg) to redshifts z > 1 
yielding catalogs containing hundreds of millions of ob- 
jects. 

The spatial distribution of these objects on different 
scales contains invaluable information that could help 
clarify many open problems in cosmology and astro- 
physics, such as the nature of dark matter and dark en- 
ergy or the pressence of primordial non-Gaussianities in 
the density field. One of the simplest observables that 
can be estimated to quantify the clustering of matter 
on different scales is the two-point correlation function 
(2PCF hereon, see section |ll]) . Its estimation is based on 
counting pairs of objects separated by a given distance 
measure, and therefore its computational time grows 
with the square of the number of objects in the catalog. 
Hence, when 0(10^"'"^^) pairs must be considered, a sim- 
plistic serial approach is too slow for the full-scale prob- 
lem, and, besides using some simplifying approximation, 
the only viable solution becomes parallelising the calcu- 
lation. In this sense modern graphical processing units 
(GPUs) provide the means to perform many operations 
in parallel on a large number (hundreds) of cores with 
a moderate clock frequency for a comparatively cheap 
price. Another approach is using a relatively smaller 
number of high-frequency CPU cores both in shared or 
distributed memory machines. 

Here we present a CUTE (Correlation Utilities and Two- 
point Estimation) , a free open-source code that estimates 
different kinds of two-point correlations from discrete cos- 
mological catalogs using different speed-up techniques. 



II. THE TWO-POINT CORRELATION 
rUNCTION(S) 

The three-dimensional 2PCF ^(r) of a set of discrete 
points in M"^ represents the excess probability of finding 
two of them inside two small volumes dVi and dV2 sepa- 
rated by r [4 : 

{dP) =n[l+^{r)]dVidV2. (1) 

When this point distribution comes from a Poisson pro- 
cess based on an underlying random density field (^(x), 
the field's 2PCF ^^(r) = ((5(x)5(x + r)) is directly re- 
lated to that of the point distribution [5] . Note that even 
though, in principle, the two-point correlation should de- 
pend on the positions of both points, ri and r2, for homo- 
geneous fields the only dependence is on the separation 
between them r = ri — r2 . 

A. Types of correlation functions 

• The 3-D correlation function <^(r, /i) and 
^ (cr, 7r) . Different observational effects such as red- 
shift space distortions or errors in the observed 
redshifts transform what would otherwise be an 
isotropic 2PCF into a function that behaves differ- 
ently along the line of sight and in the transverse 
direction. Two coordinate systems are widely used 
in the literature: the tr — tt and r — fj, schemes (see 
fig. [T]), the relation between both being 

7r = r/z, (T = — TT^. (2) 

The r — /z scheme has the advantage that the usual 
multipole expansion is directly written in terms of 
these variables: 

= (3) 
I 

where Pi are the Legendre polynomials. Note that 
at the linear level and in the plane-parallel approx- 
imation (i.e. the Kaiser formula [6]) only the first 
three even multipoles (/ = 0, 2, 4) contribute. 
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FIG. 1: Definition of tiie different coordinate conventions 
used. 



• The monopole ^o('')- The first element {I = 0) in 
the expansion above is the angle-averaged correla- 
tion function or "monopole": 



foM-- / (4) 



This is the only non-zero contribution in the ab- 
sence of redshift distortions. 

• The radial correlation function ^r{z, Az). Cor- 
relating only pairs of galaxies aligned with the line 
of sight one computes the so-called radial correla- 
tion function, which can be made to depend locally 
only on the redshift difference Az between each pair 
of galaxies. It is related to the three-dimensional 
2PCF as 



^,(z,Az) =C(7r(z,Az),a = 0), 



where 



7r(z, Az) = xiz + Az/2) - xiz - Az/2) 



cAz 



(5) 



, (6) 



where xi^) is the radial comoving distance to red- 
shift z. 

• The angular correlation function w{9). The 
angular correlation function is the 2PCF of the den- 
sity contrast field projected on the sphere 



■w{0) = {Ss{ni)Ss{n2)), cost 
6s{h)= / dz(j){z)S{r{z)h), 



ni-n2, (7) 

(8) 



where (j){z) is the redshift selection function. The 
angular correlation function is related to ^ (r, /i) by 



w{0) = J dzi(l){zi) J dz2 4i{z2)^{r{zi,Z2,9),^i{zi,Z2,9)) 

(9) 

r(zi, Z2,e) = \Jx^{z\) +X^{z2) -2x{z{)x{z2) cos 61, 

fz z e\= \x\z^)-x\z2)\ 

^ V(x2(^i)+x'(^2))2-4x2(zi)x2(z2)cos2(^) 



B. Estimating the 2PCF from discrete data 

When calculating the correlation function from a point 
distribution several observational dificulties arise related 
to the survey selection function (or more generally, to 
the spatial distribution of the data): different parts of 
the sky may have been mapped to different depths and 
the radial (redshift) distribution is never uniform. The 
most usual technique to deal with these issues is to com- 
pare the data catalog with catalogs made of randomly 
distributed objects. Thus, since the correlation function 
is the excess probability, with respect to a random dis- 
tribution, of finding two objects within a given distance, 
it can be naively estimated as 



(10) 



where Nd and iV^ are the number of points in the data 
and random catalogs respectively and DD and RR are 
histograms containing the counts of pairs of objects found 
separated by a given distance in each catalog. This naive 
(or natural) estimator is in fact biased and several other 
ones have been proposed in the literature, making use 
not only of DD and RR but also of the cross-correlation 
of random and data DR. The most widely accepted es- 
timator is the one proposed by Landy & Szalay [7]: 



jVr(jVr-l) 

?LS = 



DD 



- ^^DR + RR 



(11) 



see [S] for a thorough comparison of different estimators. 

The most delicate part of the estimation is in fact being 
able to generate the random catalogs correctly: the back- 
ground spatial distribution, both in angles and redshift 
(i.e. the one-point function), of random objects must be 
exactly the same as in the data. Only then are mask- 
related effects accurately corrected. Also, in order to 
minimize Poisson errors in DR and RR, random cata- 
logs should be generated with more particles than the 
data. 



III. CUTE 

CUTE (Correlation Utilities and Two-point Estimation) 
is a free and open-source code for cosmological 2PCF es- 
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timation. CUTE is written in C and, in the current pub- 
lic version, comes with two implementations: one paral- 
lelized for shared-memory machines using OpenMP and 
one (CU_CUTE) that performs the correlations in a GPU 
using NVidia's CUDA architecture. CUTE calculates 4 dif- 
ferent correlation functions (3-D,monopole, angular and 
radial) with different binning schemes and speed-up tech- 
niques. Here we will explain the parallelization strategies 
followed by CUTE and some details specific to each type of 
2PCF. We refer the reader to the README file accompa- 
nying the latest public version of CUTE for the operational 
options and compilation instructions of the code. In this 
section we assume some basic knowledge of parallel com- 
puting with OpenMP and CUDA by the reader. 

It must be noted that there exist two other codes 
[TO], recently made public, designed to compute an- 
gular correlation functions with GPUs. As in the case 
of CUTE the speed-up factor (about 10^) gained by these 
codes through the use of graphical devices for paralleliza- 
tion clearly makes it worth the effort of adapting CPU 
algorithms to run on GPUs. 



A. Serial approach 

Once the random catalog has been produced, DD, DR 
and RR are computed by autocorrelating or crosscorre- 
lating each pair of catalogs. In a serial code this algo- 
rithm is extremely simple, involving one loop over each 
catalog and performing 3 operations in each iteration: 
calculating the distance between each pair of objects, 
determining the bin corresponding to that distance and 
increasing the histogram count on that bin. The corre- 
sponding C-code would be: 



int histogram [ nbins ] ; 
for ( i=0;i<npl ; { 
for(j=0;j<np2; { 

//Calculate distance between two objects 
double dist=get_dist(xl[i] ,yl[i] ,zl[i] , 

x2[j] ,y2[j] ,z2[j]) ; 
//Calculate bin number 
int ibin=bin_dist ( dist ) ; 
//Increase histogram count 
histogram [ ibin] + + ; 

} 



As we said before, these two nested loops make this an 
N"^ problem (to be precise, an nl*n2 problem), whose 
computational time will grow very fast as we increase 
the size of the catalogs. At this point parallelization or 
some kind of fast approximate method, such as the use 
of fc- Trees or pixels (see section III C 2 ) is desirable if not 
compulsory. The approach taken by CUTE is to parallelize 
one of the two loops, which introduces complications in 
this simple algorithm. 



B. Parallelization with CUDA and OpenMP 

1. Multicore shared-memory machines and OpenMP 

OpenMP [H] is an API that gives support for paral- 
lel programming in shared-memory platforms. Once a 
parallel execution block is opened, the programmer can 
define private (one independent copy per core) or shared 
(common) variables and easily divide for loops between 
all available cores. For a thorough review of the different 
features of OpenMP see [TT] . The serial code above takes 
the following form when parallelized with OpenMP: 

1 int histogram [ nbins ] ; 

2 int histo_thread [ nbins ] ; 

3 T^^pragma omp parallel default ( none ) \ 

4 private ( hthread ) shared (... ) { 

5 //Initialize private histograms 

6 for ( i =0; i<nbins ; i+-|-) 

7 h is t o _t bread [ nbins ] =0 ; 

8 75l!pragma omp for //Parallelize loop 



for ( i =0;i<npl ; i++) { 
for(j=0;j<np2; { 

//Calculate distance between two objects 



} 



} 



double dist=get_dist(xl[i] ,yl[i 
x2[j] ,y2[j: 
//Calculate bin number 
int ib i n=b i n _d i s t ( d i s t ) ; 
//Increase histogram count 
histo_thread [ ibin]-|--|-; 
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20 T^^pragma omp critical { 

21 //Add private histograms 

22 for ( i =0; i<nbins ; i+-|-) 

23 histogram [ i ]-f= hi s to _t hre ad [ i ] 

24 } 

25 } 

The strategy in this case to is to declare one private his- 
togram per execution thread that will store that thread's 
pair counts. The first loop is then divided between 
all available threads and finally all partial histograms 
are added, avoiding read/write collisions, into the final 
shared one. As can be seen, parallelization with OpenMP 
is effortless, only requiring a few extra lines of code. 



2. Graphics cards and CUDA 

A GPU (Graphical Processing Unit) is a specialized 
piece of hardware designed for fast massively parallel 
manipulation of memory addresses. As their name sug- 
gests, GPUs are mainly intended for image building and 
processing, however their highly parallel structure makes 
them ideal for intensive numerical computation, provid- 
ing a relatively cheap flop/s (floating-point operations 
per second). Hence in the last years GPUs have found 
their way into different branches of scientific research, 
computational cosmology being one of them [121 ES] Ini- 
tially the main difficulty when trying to use GPUs for 
scientific computing was the programming of the numer- 
ical algorithm, since using the standard APIs meant that 
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data had to be disguised as pixel colors and some math- 
ematical operations had to be encoded as graphics ren- 
dering. However lately a few programming models for 
GPUs have seen the light of day PU] [H] that make gen- 
eral purpose computing on GPUs (GPGPU) a lot easier. 
Of these we have chosen Nvidia's CUDA [14 for its syn- 
tactic simplicity. 

Two main complications arise when one tries to adapt 
a code to execute on a GPU. First, in a massively parallel 
environment one must take great care to avoid race con- 
ditions due to simultaneous memory read/write processes 
by different threads. Second, unlike in a multi-CPU ma- 
chine, the amount of memory "per thread" available in 
a GPU is very limited, presently of the order of a few 
GB for hundreds of processors. Besides this there are 
other more subtle concerns, such as intra- warp commu- 
nication or the presence of different types of cached and 
uncached memory in the GPU, the correct use of which 
may enhance dramatically the code's performance. In 
summary, correctly parallelising a code with GUDA is 
not as straightforward as it is with OpenMP, and in some 
cases it may not be worth the effort. For an introduction 
to CUDA and its many features see [l5] . 

Implementing the serial algorithm above in CUDA 
would involve executing the following __device__ func- 
tion in every thread in parallel: 

__shared__ int histo-thread [ nbins ] ; 

int s t r i d e=blockDim . X* gridDim . X ; 

//Initialize shared histogram 

histo_thread [threadid .x]=0; 

__syncthreads () ; 

/ / Correlate 

for ( i =0;i<npl ; { 

int j=threadldx . x-|-blockIdx . x* blockDim . x ; 
while (j<np2) { 

//Calculate distance between two objects 
double dist=get_dist(xl[i] ,yl[i] ,zl[i] , 

x2[j] ,y2[j] ,z2[j]) ; 
//Calculate bin number 
int ibin=bin_dist ( dist ) ; 
//Increase histogram count 
atomicAdd (&( histo_thread[ibin]) ,1); 



} 



//Increase second inc 
j-f=s t r i d e ; 



ex by stride 



} 

//Add block histograms 
--Syncthreads () ; 

atomicAdd (&( hist ogram [threadldx.x]) , 
h is t o _t hr e ad [ t hreadldx . X ] ) ; 

As before, we have divided one of the loops (this time the 
second one) among all the execution threads. The first 
difference with respect to OpenMP that we can see in- 
mediately is that, due to the limited amount of memory 
of the GPU, we can only declare one partial histogram 
per block, and not per thread. To do this we declare 
it as a variable in shared memory, which also has the 
advantage of having a lower latency than global mem- 
ory. This introduces a new complication, since now all 
threads in a block will try to add their pair counts to 
the same histogram. This has to be done avoiding race 
conditions by using the GUDA atomicAddO function. 




FIG. 2: Illustration of the method used to calculate the 3-D 
correlation function in CUDA. Since the whole 2-dimensional 
128 X 128 bin histogram cannot be fit inside the device's shared 
memory, it is split into smaller ones, which are filled sepa- 
rately. Although the catalogs have to be correlated more than 
once, the bottleneck caused by atomic operations is largely 
mitigated by the higher number of histogram bins. 



This is in fact the bottleneck of any algorithm involving 
histograms in CUDA (especially if the distribution un- 
der study is very degenerate), since many threads may 
have to remain idle while waiting for other threads to 
update their histogram entries. The best way to pal- 
liate this problem is to use at most as many threads 
per block as histogram bins (in fact, note that the algo- 
rithm above will only work when using as many threads 
as histogram bins, however it can be easily extended to 
more general cases). Finally all the partial block his- 
tograms are summed up into the global histogram in an 
ordered manner using again atomicAddO . The fact that 
CUTE uses CUDA atomic functions such as atomicAddO 
means that it will only run on GPUs that support atomic 
operations (namely compute capability 2.0 or higher). 
There exist general algorithms for histograms that work 
on any CUDA-enabled device [T11|T7], however no perfor- 
mance improvement was observed with respect to using 
atomicAddO. Furthermore, the method above reduces 
the use of shared memory for histograms to a minimum, 
allowing its use for other purposes. 



C. Specifics of the 2PCFs 

In the previous section we have described the general 
strategy followed to parallelize the calculation of any 
2PCF with OpenMP and GUDA. However each of the 
different 2PGFs detailed in section [H] requires a differ- 
ent treatment of the data and maybe allows for different, 
more optimal, approaches. We give the details specific 
to each of these types here. 
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1. Radial correlation function 

As was said in section |ll] the radial 2PCF is calculated 
by correlating pairs of aligned objects and binning them 
according to their relative redshift difference Az. The 
approach taken by CUTE to define aligned pairs is to di- 
vide the sky into small angular pixels and to correlate 
only pairs of objects belonging to the same pixel. For 
reasonable pixel sizes (< 1 sq-deg) the number of objects 
to correlate per pixel is relatively small (0(10^"'^) for 
expected catalog sizes). Hence, since the computational 
time in this case is not an issue, there is no need for mas- 
sive parallelization, and radial correlation functions are 
only supported by CUTE in its OpenMP version. 




# objects 



2. Angular correlation function 

For the calculation of angular 2PCFs CUTE projects all 
objects in the catalog into the unit sphere and correlates 
pairs of objects according to their angular separation 6 
(see figure [ij, which is used as a distance measure. Two 
complementary speed-up techniques can be used by CUTE 
in this case: if one is not interested in extremely small 
angular scales one can create a pixel map from the cata- 
log and then correlate the pixels (weighting each of them 
by the number of objects that fall inside it). This may 
effectively reduce the number of objects that must be cor- 
related by an order of magnitude and therefore reduce the 
computational time by a factor of 100. Also, in the cal- 
culation of the angular separation, the arc-cosine of the 
scalar product of two position vectors must be estimated. 
Calculating the arc-cosine is a very time-consuming op- 
eration, and, if one is not interested in very large angular 
scales, the following approximation can be used. 



arccos(l 



2x 



45 



which is precise to 1 part in 10^* for angles below 40° and 
reduces the computational time by a factor 2. Both 
these time-saving techniques can be swiched on or off in 
CUTE by the user. 



3. 3-D correlation function 

The main difference in the calculation of the 3-D cor- 
relation function with respect to the other 2PCFs is that 
pairs are binned in 2-dimensional histograms, according 
to their (r, /i) or (tt, ct) separations. This does not in- 
troduce any relevant changes in the OpenMP implemen- 
tation, as long as the the amount of shared memory is 
large enough to accomodate one private 2-D histogram 
per thread, however it does matter when adapting the 
code to CUDA. The reason is that currently the amount 
of shared memory per block in CPUs is limited to 48 kB, 



FIG. 3; Computational times employed by different devices 
to compute the monopole 2PCF of catalogs of different sizes. 
A speed-up factor of 0(100) can be gained by using a high- 
end GPU with respect to a sequential approach on a high-end 
CPU. Even with a regular gaming GPU the increase in speed 
is substantial (0(10)). The different devices are described in 
table B 



which is too little to allocate, for example, a 128 x 128 ar- 
ray of long integers. The solution to this problem chosen 
for CUTE is explained in fig. [2} the catalogs are corre- 
lated several times, each time binning only pairs whose 
separation in one of the two coordinates is within a given 
range, until the whole histogram is filled. Thus we can de- 
clare smaller 2-D histograms in shared memory, and even 
though the catalogs must be correlated several times, the 
histogram-filling bottleneck mentioned in section |IIIB 2| 
is largely alleviated by the higher number of histogram 
entries, thus conserving a reasonable computational time 
(see section IV). 



IV. PERFORMANCE 

We have tested CUTE's performance in terms of com- 
putational time by running it on platforms with differ- 
ent capabilities, listed in table |lj The serial version was 
tested by running CUTE on a single CPU core. We also 
tested the OpenMP version on a dual-core laptop and 
on a large shared-memory machine with 80 cores. The 
CUDA version has been tried on a regular graphics card 
in a laptop and on a high-end GPU. For this test the 
monopole 2PCF was calculated for catalogs of different 
sizes in the range 10'^ — 10^. The computational times 
for one single correlation (i.e. just calculating, for ex- 
ample, DR) in the 5 different platforms are plotted in 
figure [3j As expected, using CPUs or parallelising the 
computation on several CPU cores improves the code's 
speed by a factor 10 - 100, even using a regular video- 
game graphics card. The ellapsed times were measured 
using OpenMP and CUDA timing functions, since these 
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Name 


Description 


#cores 


CPUs 


Sequential 
Laptop-MP 
Server-MP 


Intel Core i7-2620M 
Intel Core i7-2620M 
Intel MP NEHALEM-EX (x8) 


1 core (= 1 thread) 
2 cores (= 4 threads) 
80 cores (= 160 threads) 


CPUs 


Laptop-GPU 
Server-GPU 


NVIDIA NVS 4200M 
NVIDIA TESLA C2070 FERMI 


48 CUDA cores 
448 CUDA cores 



TABLE I: Different devices in which CUTE has been tested: a single CPU core, a dual core, a multi-core shared-memory machine 
(160 threads), an ordinary graphics card and a high-end GPU. 



Platform 




r(Ciog(r)) 


Tiw{e)) 


TKi^e)) 




Sequential 


877 


5230 


1374 


21 


2238 


Laptop-MP 


389 


2676 


628 


5.3 


1064 


Laptop-GPU 


113 


185 


283 


6.2 


297 


Server-MP 


25 


52 


32 


0.51 


50 


Server-GPU 


13 


20 


22 


0.46 


27 



TABLE II: Computational times ellapsed, for each of the 5 platforms listed in table|T] during the calculation of 5 different 2PCFs: 
monopole (^(r)), monopole with logarithmic binning (Ciog('~)), angular (ui(S)), angular with pixels of resolution Afl = 5 x 10^^ 
sq-deg (u'pix(S)) and 3-D (^(cr, tt)). Times are in seconds and correspond to the calculation of the DR histogram (the full 
calculation of the 2PCF should take 2-3 times longer). 



give the most accurate estimate of the time spent doing 
the actual correlation. 

For completeness we have also listed in table |TT] the 
computational times taken by the 5 different devices to 
calculate different correlation functions. The dataset 
used for this exercise is a subset of one of the mock 
catalogs provided by the MICE project [5^ [T^, with 
0° < dec < 18°, 0° < R.A. < 18°, 0.5 < z < 0.6, con- 
taining ~ 3 X 10^ particles. The 5 different correlation 
functions are: 

• Monopole correlation function: linear binning for 
r < lOOMpc/ft. and 256 bins. 

• Monopole correlation function: logarithmic binning 
for r < 100Mpc//i using 256 bins and 50 bins per 
decade. 

• Angular correlation function: linear binning for 9 < 
30° and 256 bins. Calculated by brute-force. 

• Angular correlation function: linear binning for 
6 < 30° and 256 bins. Calculated using pixels with 
resolution Ail = 5 x 10^^ sq-deg. 

• 3-D correlation function: binning in (tt, a) on a 64 x 
64-bin histogram. 

Examples of the output produced by CUTE for the dif- 
ferent kinds of correlation functions can be seen in figures 



Eland! 

V. SUMMARY 

We have presented CUTE, a parallel code for computing 
two-point correlation functions from cosmological cata- 
logs. CUTE has been optimized to run on shared-memory 
machines as well as graphical processing units. It can es- 
timate the 3-D, monopole, radial and angular correlation 
functions from a set of data using different speed-up tech- 
niques and binning schemes. We have shown that great 
benefits in terms of computational speed can be gained 
by parallelising the algorithm on CPUs. 

The code is publicly available through our web- 
site |htt p: //members ■ if t .uam-csic . es/dmonge/CUTE7] 
'html, CUTE is released under the GNU Public License 
(GPL). 
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FIG. 5; 3-D correlation function calculated from the MICE mock catalog. 



